Large Language Models (LLMs) are known for their extensive computational requirements. This is due to their large number of parameters, stored in matrices that are multiplied to produce an output.
Typically, the size of a model is calculated by multiplying the number of parameters (size) by the precision of these values (data type). However, to save memory, weights can be stored using different data types through a process known as quantization.
In this article, we will see how to reduce the precision of these parameters while maintaining good performance. We will summarize the latest spectacular improvements in this field and apply them to a toy example using a GPT-2 model.
To give a bit of background, we distinguish two main families of weight quantization techniques in the literature:
Post-Training Quantization (PTQ) is a straightforward technique where the weights of an already trained model are converted to lower precision without necessitating any retraining. Although easy to implement, PTQ is associated with potential performance degradation.
Quantization-Aware Training (QAT) incorporates the weight conversion process during the pre-training or fine-tuning stage, resulting in enhanced model performance. However, QAT is computationally expensive and demands representative training data.
In this article, we focus on PTQ to reduce the precision of our parameters. Common precisions, also known as “floating point data types,” include float32 (FP32), float16 (FP16), and bfloat16 (BF16):
FP32 stands for the standardized IEEE 32-bit floating point representation, allowing for a vast range of floating numbers. This format assigns 8 bits for the exponent, 23 bits for the mantissa, and 1 bit for the sign of the number. Most hardware supports FP32 operations and instructions.
FP16, on the other hand, reserves 5 bits for the exponent and 10 bits for the mantissa. This considerably reduces the representable range of FP16 numbers compared to FP32, and exposes FP16 numbers to the risk of overflowing and underflowing.
BF16 was created to mitigate the limitations of FP16. It reserves 8 bits for the exponent (as in FP32) and 7 bits for the fraction, maintaining the same dynamic range as FP32 but with less precision than FP16.
In ML jargon, FP32 is often termed “full precision” (4 bytes), while BF16 and FP16 are “half-precision” (2 bytes). Additionally, the int8 (INT8) data type consists of an 8-bit representation capable of storing 2^8 = 256 different values.
Let’s see how to convert FP32 weights into an INT8 format.
Naive 8-bit Quantization
In quantization, the original data is “rounded” from one data type to another, leading to a lossy compression and potential information loss. Two popular 8-bit quantization techniques are zero-point quantization and absolute maximum (absmax) quantization. These techniques map floating-point values into the more compact int8 (1 byte) values.
With zero-point quantization, values in a specific range (e.g., -1.0 to 1.0) are scaled by a factor (e.g., 127 for the range -127 to 127) and rounded to the nearest 8-bit precision. To retrieve the original value, the int8 value is divided by the same quantization factor.
Absmax quantization operates slightly differently. To map an FP16 number to an int8 number, the original number is divided by the absolute maximum value of the tensor and multiplied by the total range of the data type. To retrieve the original FP16 values, the int8 number is divided by the quantization factor, acknowledging some loss of precision due to rounding.
These quantization techniques can reduce the size of models dramatically while preserving most of their performance, making them valuable tools in efficient deployment of machine learning models.
Let’s implement it using the transformers library. We start by loading the model and tokenizer for GPT-2 using the Hugging Face’s transformers library. We want to observe the model’s size before and after the quantization process to evaluate the potential memory savings.
from transformers import AutoModelForCausalLM, AutoTokenizerimport torchtorch.manual_seed(42)# Set device to CPU for nowdevice ='cpu'# Load model and tokenizermodel_id ='gpt2'model = AutoModelForCausalLM.from_pretrained(model_id).to(device)tokenizer = AutoTokenizer.from_pretrained(model_id)print(f"Model size: {model.get_memory_footprint():,} bytes")
Model size: 510,342,192 bytes
We want to quantize these weights. We create a function that computes the absolute maximum of the tensor, which is used as a scaling factor to normalize the tensor values. The normalized tensor values are then rounded to nearest integers and stored in int8 format. The function also returns a dequantized version of the tensor for comparison, where the quantized tensor is scaled back by the original absolute maximum.
In the following example, we apply it to the first attention layer of the GPT-2 model.
We clearly see the difference betweent the original (floats) and quantized tensors (integers between -128 and 127).
After that, we define a function generate_text() we will use it to compare the text generated by the original model and the quantized model. To quantize it, we apply the absmax_quantize() function on all weights. Note that we’re replacing the original weights with the dequantized ones. Indeed, gradients require floating points to be calculated. In a real scenario, we would dequantize them to run the model (in FP16 for example) but store them as INT8.
import numpy as np# Define the text generation functiondef generate_text(model, input_text, max_length=50): input_ids = tokenizer.encode(input_text, return_tensors='pt').to(device) attention_mask = torch.ones(input_ids.shape, dtype=torch.long).to(device) pad_token_id = tokenizer.eos_token_id output = model.generate(inputs=input_ids, max_length=max_length, do_sample=True, temperature=0.7, attention_mask=attention_mask, pad_token_id=pad_token_id)return tokenizer.decode(output[0], skip_special_tokens=True)# Generate text with original modeloriginal_text = generate_text(model, "I have a dream")# Quantize all model weightsweights = []weights_quant = []for p in model.parameters(): weights.append(p.data) _, dequantized = absmax_quantize(p.data) p.data = dequantized weights_quant.append(dequantized)# Generate text with quantized modelquantized_text = generate_text(model, "I have a dream")
Before we print the generated text, I want to check the impact of the quantization on the weights. Intuitively, we want to make sure that the quantized weights are close to the original ones. A way to check it is to plot the distribution of the dequantized and original weights. A lossy quantization would drastically change the weight distribution.
The following figure shows this comparison, where the blue histogram represents the original (FP32) weights, and the red one represents the dequantized (from INT8) weights. Note that we only display this plot between -2 and 2.
import matplotlib.pyplot as pltimport matplotlib.ticker as ticker# Flatten weight tensorsweights = np.concatenate([t.cpu().numpy().flatten() for t in weights])weights_quant = np.concatenate([t.cpu().numpy().flatten() for t in weights_quant])# Set background styleplt.style.use('ggplot')# Create figure and axisfig, ax = plt.subplots(figsize=(10,5), dpi=300)# Plot the histogramsax.hist(weights, bins=150, alpha=0.5, label='Original FP32 weights', color='blue', range=(-2, 2))ax.hist(weights_quant, bins=150, alpha=0.5, label='Dequantized INT8 weights', color='red', range=(-2, 2))# Add gridax.grid(True, linestyle='--', alpha=0.6)# Add legendax.legend()# Add title and labelsax.set_title('Comparison of Original and Dequantized Weights', fontsize=16)ax.set_xlabel('Weights', fontsize=14)ax.set_ylabel('Count', fontsize=14)plt.gca().yaxis.set_major_formatter(ticker.EngFormatter()) # Make y-ticks more human readable# Improve fontplt.rc('font', size=12)plt.tight_layout()plt.show()
We observe a surprising spike around 0. This spike shows that our quantization is quite lossy since reversing the process doesn’t output the original values.
Let’s verifiy that by printing the output of each model.
Original model: I have a dream.
On that day, I made a joke, like, 'Fuck you, I'm going to do this for you.'
I was getting up at 7:45 in the morning, and I was like,
Quantized model: I have a dream of getting to know your enemy, but it is impossible. I have a dream of getting to know your enemies. But it is impossible. Why do you think I am capable of doing this? Because feathers are my best friend.
None of the outputs are particularly good, but it feels like the second one is very random. We can try to quantify this intuition by calculating the perplexity of each output. Perplexity is a common metric used to evaluate language models. It measures the uncertainty of a model in predicting the next token in a sequence.
We implement it using a quick function since it doesn’t need to consider details like the length of the context window here (our sentences are short).
As expected, the perplexity of the first output is much lower than the second one. We could repeat this process for multiple generations and find similar results. This shows that the quality of the generated text dropped because of the quantization process.
In the next section, we will see how we can do better using a more efficient INT8 quantization method.
8-bit Quantization with LLM.int8()
device = torch.device('cuda'if torch.cuda.is_available() else'cpu')model = AutoModelForCausalLM.from_pretrained(model_id, device_map='auto', load_in_8bit=True)print(f"Model size: {model.get_memory_footprint():,} bytes")# Generate text with quantized modeltext_llm_int8 = generate_text(model, "I have a dream")print(text_llm_int8)
===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please run
python -m bitsandbytes
and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
================================================================================
bin /usr/local/lib/python3.10/dist-packages/bitsandbytes/libbitsandbytes_cuda118.so
CUDA_SETUP: WARNING! libcudart.so not found in any environmental path. Searching in backup paths...
CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 7.5
CUDA SETUP: Detected CUDA version 118
CUDA SETUP: Loading binary /usr/local/lib/python3.10/dist-packages/bitsandbytes/libbitsandbytes_cuda118.so...
Model size: 261,462,552 bytes
I have a dream. I don't know what it is, but I have a dream. I have a dream. I have a dream. I have a dream. I have a dream. I have a dream. I have a dream. I
/usr/local/lib/python3.10/dist-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: /usr/lib64-nvidia did not contain ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] as expected! Searching further paths...
warn(msg)
/usr/local/lib/python3.10/dist-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('/sys/fs/cgroup/memory.events /var/colab/cgroup/jupyter-children/memory.events')}
warn(msg)
/usr/local/lib/python3.10/dist-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('8013'), PosixPath('http'), PosixPath('//172.28.0.1')}
warn(msg)
/usr/local/lib/python3.10/dist-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('--logtostderr --listen_host=172.28.0.12 --target_host=172.28.0.12 --tunnel_background_save_url=https'), PosixPath('//colab.research.google.com/tun/m/cc48301118ce562b961b3c22d803539adc1e0c19/gpu-t4-s-202grgpeh6xn8 --tunnel_background_save_delay=10s --tunnel_periodic_background_save_frequency=30m0s --enable_output_coalescing=true --output_coalescing_required=true')}
warn(msg)
/usr/local/lib/python3.10/dist-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('/env/python')}
warn(msg)
/usr/local/lib/python3.10/dist-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('module'), PosixPath('//ipykernel.pylab.backend_inline')}
warn(msg)
/usr/local/lib/python3.10/dist-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: Found duplicate ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] files: {PosixPath('/usr/local/cuda/lib64/libcudart.so'), PosixPath('/usr/local/cuda/lib64/libcudart.so.11.0')}.. We'll flip a coin and try one of these, in order to fail forward.
Either way, this might cause trouble in the future:
If you get `CUDA error: invalid device function` errors, the above might be the cause and the solution is to make sure only one ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] in the paths that we search based on your env.
warn(msg)
You are loading your model in 8bit or 4bit but no linear modules were found in your model. this can happen for some architectures such as gpt2 that uses Conv1D instead of Linear layers. Please double check your model architecture, or submit an issue on github if you think this is a bug.
!pip install -q auto-gptqmodel = AutoModelForCausalLM.from_pretrained(model_id, device_map='auto', load_in_4bit=True)print(f"Model size: {model.get_memory_footprint():,} bytes")# Generate text with quantized modeltext_nf4 = generate_text(model, "I have a dream")print(text_nf4)
You are loading your model in 8bit or 4bit but no linear modules were found in your model. this can happen for some architectures such as gpt2 that uses Conv1D instead of Linear layers. Please double check your model architecture, or submit an issue on github if you think this is a bug.
Model size: 261,462,552 bytes
I have a dream. I have a dream of having a family, of having a job. And I do it all with love. I love my love. And I don't have a dream anymore. Don't be afraid to be a good person
examples = ["What's the weather like today in New York City? I'm planning to visit Central Park in the afternoon.","Hey, check out this article on the benefits of a plant-based diet: https://www.example.com/plant-based-diet","Can you recommend any good science fiction books? I love stories about time travel and space exploration.","How can I learn a new language quickly? I'm planning to move to Spain next year and need to learn Spanish.","Just finished watching The Matrix. What are some other popular sci-fi movies to watch this weekend?","I'm feeling stressed lately. What are some effective ways to deal with stress and improve my mental health?","What are the top tourist attractions in Paris? I'll be visiting the city for a week and want to make the most of my time there.","Tell me a joke about computers. I need something to cheer me up after a long day at work.","How do I cook spaghetti carbonara? Can you share a simple recipe that I can follow at home?","Can you give me a brief summary of the latest news? I haven't had the chance to catch up on current events.","What's the difference between a psychologist and a psychiatrist? I'm considering therapy but not sure which one to see.","I'm planning to start my own online store. What are the steps to start a small business and make it successful?",]examples = [tokenizer(text, truncation=True) for text in examples]model.quantize( examples, use_triton=True, autotune_warmup_after_quantized=True, batch_size=1,)
WARNING:auto_gptq.modeling._utils:using autotune_warmup will move model to GPU, make sure you have enough VRAM to load the whole model.
100%|██████████| 11/11 [03:16<00:00, 17.84s/it]
text_nf4 = generate_text(model, "I have a dream")print(text_nf4)
I have a dream job. I'm looking forward to being at the top of the standings. No excuses. I have to stay where I am and try to win as many games as I can. I'm just trying to be more than the team
Beldaka and Dettmers, A Gentle Introduction to 8-bit Matrix Multiplication, Hugging Face Blog (2022): Blog post that introduces the main ideas behind LLM.int8() and how to use it in the Hugging Face ecosystem.
Weng, Lilian, Large Transformer Model Inference Optimization, Lil’Log (2023): Exhaustive overview of different techniques to optimize inference, including distillation and weight pruning.
Czarnogorski, Kamil, Local Large Language Models, Int8 (2023): Great overview of different methods to run LLMs on your local hardware.